Pericytes are a type of cell that enwraps small blood vessels
throughout all vascularized tissues of the body. Pericytes are thought
to be a key effector cell that modulate blood vessel behavior and can
profoundly influence clinical outcomes in chronic disease. However,
pericytes are a difficult cell type to study because they lack exclusive
cellular markers, making identification and isolation difficult.
Currently, pericytes are identified by a panel of markers, their cell
shape, and their perivascular position relative to capillaries
(Fig 1). Recently, a study by the Yuan Lab [1] at
Harvard Medical School sought to identify novel pericyte-specific
markers using single cell RNA-seq from the Tabula Muris Atlas. The
Tabula Muris atlas is a compendium of single cell transcriptome data
from the model organism Mus musculus, containing nearly 100,000 cells
from 20 organs and tissues [2].
The authors used UMAP to visualize gene expression from
clustering with a KNN graph based on Euclidean distance generated with
linear dimension reduction with PCA. Pericyte associated clusters were
identified with high co-expression of two pericyte markers, CSPG4 and
PDGFRB. A subclass of cells in this group with exceptionally high
co-expression were identified as “stringent” pericytes and examined
further. Differential gene expression analysis of stringent pericytes
against all other cell types yielded a panel of novel candidate pericyte
markers. While this study identified interesting new potential pericyte
markers for specific research, there are potential issues with the
methodology used for the processing pipeline that will yield a better
understanding of the results:
These queries motivate the following research
questions:
1. Will the same overall trends in clustering still appear if the
datasets are not integrated across tissue types?
2. Does the clustering approach used in the study effectively delineate
between the manually annotated cell types in the database?
3. Do the pericyte designated clusters align with the manual annotations
that already exist in the database?
Primary Paper and Dataset References
[1] S.-H. Baek, E. Maiorino, H. Kim, K. Glass, B. A. Raby, K. Yuan,
Single Cell Transcriptomic Analysis Reveals Organ Specific Pericyte
Markers and Identities. Frontiers in Cardiovascular Medicine. 9 (2022)
(available at https://www.frontiersin.org/articles/10.3389/fcvm.2022.876591).
[2] The Tabula Muris Consortium., Overall coordination., Logistical
coordination. et al. Single-cell transcriptomics of 20 mouse organs
creates a Tabula Muris. Nature 562, 367–372 (2018). https://doi.org/10.1038/s41586-018-0590-4.
Primary Packages: Seurat, AnnotationHub, ggplot,
tidyverse, pheatmap.
Supporting Packages: cowplot, RColorBrewer,
BiocManager, stringr
>> Code for loading packages and initializing parallel processing.
# Note: this code assumes that the working directory is set to the source file
# location
# Install required Base Packages
base_packages <- c("Seurat", "ggplot2", "tidyverse", "cowplot", "RColorBrewer",
"BiocManager", "stringr", "pheatmap", "future", "rmdformats",
"memuse")
install.packages(setdiff(base_packages, rownames(installed.packages())))
# Install required Bioconductor Packages
biocm_packages <- c("AnnotationHub", "ensembldb")
bioc_installs <- setdiff(biocm_packages, rownames(installed.packages()))
if (length(bioc_installs)) {BiocManager::install(bioc_installs) }
# Load base packages
lapply(base_packages, library, character.only = TRUE)
# Load Bioconductor packages packages
lapply(biocm_packages, library, character.only = TRUE)
# Setup multicore processing
plan(multisession, workers = max(c(parallelly::availableCores()-2,1)))
# Set max ram to 90% of available
options(future.globals.maxSize = as.numeric(memuse::Sys.meminfo()[[2]]) *.9)For the purposes of clarity and reproducibility, below are summaries of the processing pipeline used in this project, along with an explanation of the gene annotation database and supporting files necessary to carry out the analysis.
Below are function parameters that were used in the reference
publication for processing. 1. NormalizeData(normalization.method =
“LogNormalize”, scale.factor = 10000)
2. FindVariableFeatures(selection.method = “vst”, nfeatures =
2000)
3. ScaleData(model.use = “linear”,)
4. RunPCA()
5. Dimensionality: Bladder: 30, Kidney: 45, Lung and Heart: 45 5.
FindNeighbors(dims = 1:dimensionality)
6. FindClusters(resolution = 0.5, algorithm=1)
7. RunUMAP()
Seurat single cell datasets typically record gene expression by
their ensemble ID which use unique static codes to identify features in
the genome. However, there are not human readable, so we must convert
ensemble IDs to the gene name with annotated gene database. Pipelines
typically use bioMart or AnnotationHub as a reference database.
AnnotationHub was used because it maintains a 1:1 mapping from ensemble
ID’s to gene names, while bioMart does not because it has a one to many
relationship between ensemble IDs and entrez IDs.
Code for annotating genes with gene name.
# Check for gene annotation file
if (!file.exists(paste0(getwd(),"/data/annotations_Mm.csv"))) {
# Connect to Annotation Hub as the source for gene annotations
ah <- AnnotationHub()
# Access the Ensemble database for organism
ahDb <- query(ah, pattern = c("Mus musculus", "EnsDb"), ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>% mcols() %>% rownames() %>% tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]
# Extract gene-level information from database
annotations <- genes(edb, return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)
# Export table to csv
data.table::fwrite(annotations, file = paste0(getwd(),"/data/annotations_Mm.csv"))
}
# Load ensembl <-> gene annotation conversion table
gene_annotations <- data.table::fread(file = paste0(getwd(),"/data/annotations_Mm.csv"),
header = TRUE)Seurat datafiles provided by the Tabula Muris were used in this analysis for bladder, heart, kindey, and lung tissues.
Direct Link to Seurat datafiles found on Figshare.
Data sets were renamed to [tissue].rds and saved in the /data/ subfolder of the project repository.
bladder.rds
348 MB, MD5 checksum: 12349521850bed75aaba9a4abb5daf48
heart.rds
81 MB, MD5 checksum: f4740230d092fe5c072a3eacb5f533e6
kidney.rds
378 MB, MD5 checksum: afae3d5895401126c51090f5d6f9299a
lung.rds
752 MB,MD5 checksum: f6a12d88b19e7ecd94256c0f0f0ed284
>> Code for processing Seurat pipeline.
# Get dataset filenames, assumed to be named by tissue type
data_filenames <- str_replace(list.files(paste0(getwd(),'/data/'),pattern='rds',
full.names = F), '.rds','')[c(4,2,3,1)]
# Specify dimensions for linear dimensional reduction (authors used in paper)
dim_thresh <- c(30,35,45,35)
# SPecify clusters of interest for each tissue
clusters_oi <- list(13,c(11,10),19,24)
# Load all Seurat objects into list
seurat_objs <- list()
# Process and cluster each tissue data set individually (do not integrate)
if (!file.exists(paste0(getwd(),"/results/seurat_objs.rdata"))) {
for(x in seq_along(data_filenames)){
# Load seurat object
seurat_obj <- readRDS(paste0(getwd(),'/data/', data_filenames[x], '.rds'))
# Create cleaned cell type labels
clean_labels <- function(x)
str_replace_all(str_replace_all(x, paste0(data_filenames[x]," "),""),
c("cell "="", "^nan$" = "unknown"))
seurat_obj@meta.data$cell_label <-
factor(str_to_title(clean_labels(seurat_obj@meta.data$free_annotation)))
# Normalize expression by fraction of total, * 10,000, and log-transform
seurat_objs[[x]] <- NormalizeData(object = seurat_obj)
# Calculates z-score of gene expression dispersion within dataset (binned by their dispersion)
seurat_objs[[x]] <- FindVariableFeatures(object = seurat_objs[[x]])
# Eliminate uninteresting sources of variation (batch effects, technical noise, cell cycle)
# By regressing out these signals with linear models and outputting the residuals
seurat_objs[[x]] <- ScaleData(object = seurat_objs[[x]])
# Perform PCA on residuals to compress dataset and reduce noise from inconsequential genes
seurat_objs[[x]] <- RunPCA(object = seurat_objs[[x]])
# Automated alternative for determining number of PCs to include
# JackStrawPlot(object = seurat_objs[[x]], PCs = 1:50)
# Construct KNN graph refined with Jaccard similarity between local neighborhoods
seurat_objs[[x]] <- FindNeighbors(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
# Cluster cells with Louvain algorithm (default) or SLM
seurat_objs[[x]] <- FindClusters(object = seurat_objs[[x]], resolution = 0.5)
# Run non-linear dimension reduction to reveal underlying manifold of the data
seurat_objs[[x]] <- RunUMAP(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
}
# Export seurat object
names(seurat_objs) <- data_filenames
save(seurat_objs, file = paste0(getwd(),"/results/seurat_objs.rdata"))
} else {
# If seurat datafiles already computed, load from disk
load(paste0(getwd(),"/results/seurat_objs.rdata"))
}>> Code for visualizations.
# Visualizations of dimensions, clusters, and gene expression
# Load all Seurat objects into lists
gg_elbows <- list()
gg_umaps <- list()
gg_dots <- list()
gg_cows <- list()
gg_annot_umap = list()
gg_annot = list()
gg_annot_table= list()
# Visualizations
for(x in seq_along(data_filenames)){
# Convert cell type labels
remove_list = c("Epcam","Pecam", data_filenames[x], "cell")
#
clean_labels = function(k) str_replace_all(str_replace_all(k, paste(remove_list,collapse="|"),""),
c("\\s+"=" ", "^nan$" = "unknown"))
seurat_objs[[x]]@meta.data$cell_label <-
factor(str_to_title(str_trim(clean_labels(seurat_objs[[x]]@meta.data$free_annotation))))
# Determine number of PCs to use
gg_elbows[[x]] <- ElbowPlot(seurat_objs[[x]], ndims = 50, reduction = "pca") +
ggtitle(tools::toTitleCase(data_filenames[x]), ) +
geom_vline(xintercept = dim_thresh[x], color = "red") +
theme(plot.title = element_text(hjust = 0.5))
save_plot(paste0(getwd(),'/results/elbow_', data_filenames[x], '.png'),
gg_elbows[[x]], base_width = 3, base_height = 3)
# Visualize clusters
gg_temp <- DimPlot(seurat_objs[[x]], reduction = 'umap', label = FALSE, repel = TRUE) +
ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]),
format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE))) +
theme_void() + theme(legend.position="none",plot.title = element_text(size=12,hjust = 0.5))
gg_umaps[[x]] <- LabelClusters(gg_temp, id = "ident", fontface = "bold", size = 6)
# Create a dot plot of expression of PC markers
gg_dots[[x]] <- DotPlot(seurat_objs[[x]], features = c("ENSMUSG00000032911","ENSMUSG00000024620")) +
scale_x_discrete(labels=c('CSPG4', 'PDFRB'),position = "top") + xlab("") + ylab("Cluster") +
scale_y_discrete(position = "right") + #scale_color_gradient(title) +
scale_size_area(breaks=c(25,50,75), limits = c(0, 75)) +
scale_color_gradient(breaks = rev(c(0,1,2)), limits = c(0, 2.5), low = "#D1CFD4", high = "#1E0BFE")+
guides(size=guide_legend(title="% Exp."), colour = guide_colourbar(title="Mean Exp.")) +
theme(text=element_text(size=12),plot.margin = unit(c(0, 0, 0, 0), "cm"),
axis.text.x = element_text(size =12,angle = 45, hjust =0), axis.text.y = element_text(size =12),
axis.ticks=element_blank(),panel.grid.major=element_blank(),
legend.position = "none",
panel.grid.minor = element_blank())
# Create grid of cluster and expresison table, then add legend at bottom
gg_cluster_expr <- plot_grid(gg_umaps[[x]], gg_dots[[x]], ncol = 2, rel_widths = c(3, 1))
# Add a second row with a legend
legend <- get_legend(gg_dots[[x]]+theme(legend.position = "bottom",legend.justification="center"))
gg_cows[[x]] <- plot_grid(gg_cluster_expr, legend, nrow = 2, rel_heights = c(1, .1))
save_plot(paste0(getwd(),'/results/umap_organ_', data_filenames[x], '.png'),
gg_cows[[x]], base_width = 8, base_height = 6)
# Visualize Cluster with cell type annotated by atlas
gg_annot_umap[[x]] <- DimPlot(seurat_objs[[x]], reduction = 'umap', group.by = "cell_label",
label = FALSE, repel = TRUE) +
ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]),
format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE))) +
theme(legend.position="bottom",text = element_text(size = 9),
plot.title = element_text(size=12,hjust = 0.5), legend.key.size = unit(0, 'cm'),
legend.spacing = unit(0.1, 'cm'), legend.margin = unit(0.1, 'cm')) +
guides(color=guide_legend(nrow=c(8,3,5,3)[x], byrow=TRUE, override.aes = list(size=3)))
gg_annot_umap[[x]]
# Calculate percentage of cell type per cluster
tab <- table(Assigned=seurat_objs[[x]]@meta.data$cell_label, Clusters=seurat_objs[[x]]$seurat_clusters)
perc_tab <- sweep(tab, 2, colSums(tab ), FUN = '/')
gg_annot_table[[x]] <- pheatmap(perc_tab, color = colorRampPalette(c('white','blue'))(10),
cluster_rows = F, cluster_cols = F, fontsize = 9)
gg_annot[[x]] <- plot_grid(gg_annot_umap[[x]], gg_annot_table[[x]][[4]], nrow = 2, rel_widths = c(1, 1))
save_plot(paste0(getwd(),'/results/umap_annot_', data_filenames[x], '.png'),
gg_annot[[x]], base_width = 2*6, base_height = 2*6)
} A significance step in processing higher dimensional data is the linear reduction that compresses datapoints into a series of hyperplanes that comprise a linear combination of the features in the dataset. This data used the same dimensional thresholds that the authors used in the original study because the thresholds seemed reasonable as visualized with an elbow plot for each tissue.
Figure 2.4.1: Standard deviation captured by
increasing dimensions of linear reduction of the lung gene expression
data with PCA.Authors set the dimensionality for the bladder dataset
from 1 to 30 dimensions.
Figure 2.4.2: Standard deviation captured by
increasing dimensions of linear reduction of the heart gene expression
data with PCA.Authors set the dimensionality for the heart dataset from
1 to 35 dimensions.
Figure 2.4.3: Standard deviation captured by
increasing dimensions of linear reduction of the kidney gene expression
data with PCA.Authors set the dimensionality for the kidney dataset from
1 to 45 dimensions.
Figure 2.4.4: Standard deviation captured by
increasing dimensions of linear reduction of the bladder gene expression
data with PCA.Authors set the dimensionality for the lung dataset from 1
to 35 dimensions.
To investigate effect of leaving data unintegrated for pericyte marker analysis, the data was reprecossed in Seurat following the standard workflow (Fig 2.1). The same settings were used as the original study, including the dimensional chosen for each dataset (meaning the PCA principal components that were used for linear dimension reduction and subsequent nonlinear clustering, Fig 2.4.1-4). UMAP clustering was used to visualize the data with average expression of pericyte markers CPSG4 and PDGFRB visualized, along with a heatmap of the cell type composition of each cluster according to annotations from the Tabula Muris atlas.
Figure 3.2.1A: UMAP clustering of gene expression from cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 156 cells comprising cluster 13 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 284 cells from cluster 16 in the study).
Figure 3.2.1B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Figure 3.2.2A: UMAP clustering of gene expression from cells isolated from heart issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 394 cells comprising clusters 10 and 11 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 392 cells from clusters 10 and 12 in the study).
Figure 3.2.2B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Figure 3.2.3A: UMAP clustering of gene expression from cells isolated from kidney issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 229 cells comprising cluster 19 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 358 cells from cluster 17 in the study).
Figure 3.2.3B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Figure 3.2.4A: UMAP clustering of gene expression from cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 62 cells comprising cluster 24 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 345 cells from clusters 10 and 11 in the study).
Figure 3.2.4B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Overall, the clustering approach without data integration yielded similar pericyte like clusters as those found in the original study. However, for all tissues, there appeared to be a higher fraction of cells expressing both pericyte clusters with this modified clustering approach, suggesting that avoiding integration resulted in more homogenous and purified clusters. Compared to the original study, about 10% of cells in lung tissue expressed both markers (compared to 35%), 25% of cells in heart tissue (compared to 50%), 30% of cells in kidney tissue (compared to 50%), and 10% of cells in bladder (compared to 25%).
When compared to the cell type annotation provided by the Tabula Muris database, the unsupervised clustering performed reasonably well in separating them into distinct groups (Figure 3.2.[1-4]B). For lung tissue, clusters appeared to be split between cell types that have a high degree of similarity, such as T cell types or myeloid dendritic cells (Fig 3.2.1B). Similar trends were observed with heart tissue, although there were a multitude of cell types with unknown cellular origins for this tissue type (Fig 3.2.2B). For kidney tissue, overlap was mostly seen for immune cells and endothelial cells (Fig 3.2.3B). For bladder tissue, there was overlap with epithelial cell types, but otherwise most clusters did not map to multiple cell types (Fig 3.2.4B).
The cluster of cells that had high expression of CSPG4 and PDGFRB and deemed to be a source of pericytes did not always match their annotated cell type from the atlas. For lungs, the pericyte cluster were labeled as pericytes, but were labeled as smooth muscle cells in the heart, stroma mesangial cells in the kidney, and endothelial cells in the bladder. These discrepancies will require further follow up with the literature and more detailed marker expression of these groupings, and also highlights the difficulty in determining clean groupings for cell type.
Future Questions
1. Does that marker expression of the stringent pericyte cluster contain
various pericyte markers shown in the literature?
2. Are there cells with marker expression profiles that match pericytes
found in other clusters?
3. Will the same results be observed if the data is reprocessed with
recent advances in transcriptomic analysis (3)?